Tenuous Relations and Timeseries Analysis

Nathaniel Forde

Data Science @ Personio

Agenda

  1. Motivation: Background Problem
  2. Structural Autoregressive Models
  3. Structural Autoregressive Models in PyMC
  4. Vector Autoregressive Models
  5. Bayesian Vector Autoregressive Models in PyMC
  6. Hierarchical BVARs in PyMC

Motivation: Background Problem

Disclaimers and Acknowledgements

1 2 3 4 5 6

Note

No Personio data used in this presentation. The problem discussed motivated the research, but the models presented below will use fake or public data.

We will motivate the work with a discussion of App performance measurements and customer feedback metrics.

How do they influence one another and how to best distil the relationship between the two?

Note

Acknowledgements: The work on the BVAR discussed here took inspiration and borrowed heavily from the PYMC labs blogpost here. The example PRs benefited from the feedback of the PYMC devs.

Warning

Not an Economist!!!

Systems and Feedback: App Performance

1 2 3 4 5 6

“Balancing feedback loops are equlibrating or goal seeking structures in systems and are both sources of stability and resistence to change”

“Allowing performance standards to be influenced by past performance… sets up a reinforcing feedback of eroding goals that sets a system drifting toward low performance.”

Disentangling Individual effects

1 2 3 4 5 6

Identifying the Channel of Transmission

1 2 3 4 5 6

Lagging Effects in Time

1 2 3 4 5 6

Structural Autoregressive Models

Autogressive Models

1 2 3 4 5 6

  • Regression: The Linear Projection of Y onto X is the OLS solution.
  • The Wold Decomposition: If a random variable \(Y_{t}\) is covariance stationary then \(Y_{t}\) has a linear representation: \[Y_{t} = \mu + \sum_{j=0}^{\infty}b_{j}e_{t-j}\]
  • Auto-regression: Extends this idea to Hilbert space of the infinite past history:
    • \[Y_{t} = \mu + \sum_{j=1}^{\infty} a_{j}Y_{t-j} + e_{t}\]
    • Where the error terms \(e\) represent white-noise \(N(0, 1)\)
  • “…this justifies linear models as approximations” - Hansen Econometrics

  • Most timeseries data is shorter than their infinite past.

Moving Average Representation and Impulse Response

1 2 3 4 5 6

  • The coefficients in: \(Y_{t} = \mu + \sum_{j=0}^{\infty}b_{j}e_{t-j}\) are known as the impulse response function IRF representing the change in \(Y_{t + j}\) due to a shock at time \(t\).

    • \(\dfrac{\partial}{\partial e_{t}} Y_{t + j} = b_{j}\)
  • They can be recovered from the AR representation by recursive reconstruction.

  • IRFs are often reported when scaled by the standard deviation of shocks.

Stationarity Requirements

1 2 3 4 5 6

Tricks to achieve stationarity

Conditions required for Stationarity

Structure as Story

“The Child is the father of the Man” - Wordsworth

Structural Autoregressive Models

1 2 3 4 5 6

  • Bayesian Structural Timeseries (BSTS) are often seen as an alternative to Auto-regressive models1
  • But BSTS models can also incorporate latent auto-regressive components.
  • Benefits of going Bayesian:
    • Transparent understanding of the sources of uncertainty within your model and a capacity to introduce outside information by way of priors or other covariates.
    • Expressive “lego-like” composability of the modelling structure matches the compositionality of thought.
    • Plausible counterfactual inference with posterior predictive distributions

Characterising Structure

1 2 3 4 5 6

\[ Y \sim TruncNorm(\color{purple}{\mu, \sigma, 0, \infty}) \\ \color{purple}{\mu} = \color{green}{AR_{\mu}} + f(\color{red}{seasonality}) + f(\color{CornflowerBlue}{trend}) + ... \\ \color{purple}{\sigma} \sim InverseGamma(3, .5) \\ trend = \color{CornflowerBlue}{\alpha + \beta \cdot T_{i}} \\ \color{CornflowerBlue}{\alpha} \sim Normal(7, 1) \\ \color{CornflowerBlue}{\beta} \sim Normal(0.009, .1) \\ seasonality = \color{red}{FourierFeatures^{'}\beta_{fourier}} \\ \color{red}{\beta_{fourier}} \sim Normal(0, 0.5) \\ \color{green}{AR_{\mu}} \sim \color{green}{\mu_{ar} + a_{1}\cdot Y_{t-1} + a_{2} \cdot Y_{t-2} ... + \gamma} \]

Structural Autoregressive Models in PyMC

Bayesian Structural Timeseries

1 2 3 4 5 6


 with AR:
        ## Data containers to enable prediction
        t = pm.MutableData("t", t_data, dims="obs_id")
        y = pm.MutableData("y", ar_data, dims="obs_id")
        # AR priors: The first coefficient will be the intercept term
        coefs = pm.Normal("coefs", priors["coefs"]["mu"], priors["coefs"]["sigma"])
        sigma = pm.HalfNormal("sigma", priors["sigma"])
        ## Priors for the linear trend component
        alpha = pm.Normal("alpha", priors["alpha"]["mu"], priors["alpha"]["sigma"])
        beta = pm.Normal("beta", priors["beta"]["mu"], priors["beta"]["sigma"])
        ## Priors for seasonality
        beta_fourier = pm.Normal(
            "beta_fourier",
            mu=priors["beta_fourier"]["mu"],
            sigma=priors["beta_fourier"]["sigma"],
            dims="fourier_features",
        )

        #  AR process: We need one init variable for each lag, hence size is variable too
        init = pm.Normal.dist(
            priors["init"]["mu"], priors["init"]["sigma"], size=priors["init"]["size"]
        )
        # Steps of the AR model minus the lags required given specification
        ar1 = pm.AR(
            "ar",
            coefs,
            sigma=sigma,
            init_dist=init,
            constant=True,
            steps=t.shape[0] - (priors["coefs"]["size"] - 1),
            dims="obs_id",
        )
        # Trend Component
        trend = pm.Deterministic("trend", alpha + beta * t, dims="obs_id")

        # Seasonality Component
        fourier_terms = pm.MutableData("fourier_terms", ff)
        seasonality = pm.Deterministic(
            "seasonality", pm.math.dot(beta_fourier, fourier_terms), dims="obs_id"
        )
        ## Additive Combination
        mu = ar1 + trend + seasonality

        # The Likelihood
        outcome = pm.Normal("likelihood", mu=mu, sigma=sigma, observed=y, dims="obs_id")
        ## Sampling
        idata_ar = pm.sample_prior_predictive()

with AR:
        ## Data containers to enable prediction
        t = pm.MutableData("t", t_data, dims="obs_id")
        y = pm.MutableData("y", ar_data, dims="obs_id")
        # AR priors: The first coefficient will be the intercept term
        coefs = pm.Normal("coefs", priors["coefs"]["mu"], priors["coefs"]["sigma"])
        sigma = pm.HalfNormal("sigma", priors["sigma"])
        ## Priors for the linear trend component
        alpha = pm.Normal("alpha", priors["alpha"]["mu"], priors["alpha"]["sigma"])
        beta = pm.Normal("beta", priors["beta"]["mu"], priors["beta"]["sigma"])
        ## Priors for seasonality
        beta_fourier = pm.Normal(
            "beta_fourier",
            mu=priors["beta_fourier"]["mu"],
            sigma=priors["beta_fourier"]["sigma"],
            dims="fourier_features",
        )

        #  AR process: We need one init variable for each lag, hence size is variable too
        init = pm.Normal.dist(
            priors["init"]["mu"], priors["init"]["sigma"], size=priors["init"]["size"]
        )
        # Steps of the AR model minus the lags required given specification
        ar1 = pm.AR(
            "ar",
            coefs,
            sigma=sigma,
            init_dist=init,
            constant=True,
            steps=t.shape[0] - (priors["coefs"]["size"] - 1),
            dims="obs_id",
        )
        # Trend Component
        trend = pm.Deterministic("trend", alpha + beta * t, dims="obs_id")

        # Seasonality Component
        fourier_terms = pm.MutableData("fourier_terms", ff)
        seasonality = pm.Deterministic(
            "seasonality", pm.math.dot(beta_fourier, fourier_terms), dims="obs_id"
        )
        ## Additive Combination
        mu = ar1 + trend + seasonality

        # The Likelihood
        outcome = pm.Normal("likelihood", mu=mu, sigma=sigma, observed=y, dims="obs_id")
        ## Sampling
        idata_ar = pm.sample_prior_predictive()

with AR:
        ## Data containers to enable prediction
        t = pm.MutableData("t", t_data, dims="obs_id")
        y = pm.MutableData("y", ar_data, dims="obs_id")
        # AR priors: The first coefficient will be the intercept term
        coefs = pm.Normal("coefs", priors["coefs"]["mu"], priors["coefs"]["sigma"])
        sigma = pm.HalfNormal("sigma", priors["sigma"])
        ## Priors for the linear trend component
        alpha = pm.Normal("alpha", priors["alpha"]["mu"], priors["alpha"]["sigma"])
        beta = pm.Normal("beta", priors["beta"]["mu"], priors["beta"]["sigma"])
        ## Priors for seasonality
        beta_fourier = pm.Normal(
            "beta_fourier",
            mu=priors["beta_fourier"]["mu"],
            sigma=priors["beta_fourier"]["sigma"],
            dims="fourier_features",
        )

        #  AR process: We need one init variable for each lag, hence size is variable too
        init = pm.Normal.dist(
            priors["init"]["mu"], priors["init"]["sigma"], size=priors["init"]["size"]
        )
        # Steps of the AR model minus the lags required given specification
        ar1 = pm.AR(
            "ar",
            coefs,
            sigma=sigma,
            init_dist=init,
            constant=True,
            steps=t.shape[0] - (priors["coefs"]["size"] - 1),
            dims="obs_id",
        )
        # Trend Component
        trend = pm.Deterministic("trend", alpha + beta * t, dims="obs_id")

        # Seasonality Component
        fourier_terms = pm.MutableData("fourier_terms", ff)
        seasonality = pm.Deterministic(
            "seasonality", pm.math.dot(beta_fourier, fourier_terms), dims="obs_id"
        )
        ## Additive Combination
        mu = ar1 + trend + seasonality

        # The Likelihood
        outcome = pm.Normal("likelihood", mu=mu, sigma=sigma, observed=y, dims="obs_id")
        ## Sampling
        idata_ar = pm.sample_prior_predictive()

Prior Predictive and Posterior Predictive Distributions

1 2 3 4 5 6

Autoregressive BSTS prior and posterior fits. For details see PYMC docs here

Causal Inference with Interrupted Timeseries Designs

1 2 3 4 5 6

Causal Impact analysis for more detail see CausalPy

Vector Autoregressive Models

Multivariate Timeseries

1 2 3 4 5 6

  • Multivariate Linear Projection Solution applies giving
    • \[ \mathbf{y}_{T} = \nu + A_{1}\mathbf{y}_{T-1} + A_{2}\mathbf{y}_{T-2} ... A_{p}\mathbf{y}_{T-p} + \mathbf{e}_{t} \]

    • Wold Decomposition: \[ \mathbf{y}_{T} = \mu + \Omega_{1}\mathbf{e}_{T-1} + \Omega_{2}\mathbf{e}_{T-2} ... \Omega_{p}\mathbf{e}_{T-p} \]

  • Multivariate Auto-regressive representation with MV white noise error terms \[ \begin{bmatrix} gdp \\ inv \\ con \end{bmatrix}_{T} = \nu + A_{1}\begin{bmatrix} gdp \\ inv \\ con \end{bmatrix}_{T-1} + A_{2}\begin{bmatrix} gdp \\ inv \\ con \end{bmatrix}_{T-2} ... A_{p}\begin{bmatrix} gdp \\ inv \\ con \end{bmatrix}_{T-p} + \mathbf{e}_{t} \]

VAR(2)

1 2 3 4 5 6

A VAR can have n lags and m equations such that the lagged terms of each series appear in all equations.

\[ gdp_{t} = \beta_{gdp1} \cdot gdp_{t-1} + \beta_{gdp2} \cdot gdp_{t-2} + \color{red}{\beta_{cons1}} \cdot cons_{t-1} + \color{red}{\beta_{cons2}} \cdot cons_{t-2} + \epsilon_{gdp}\] \[ cons_{t} = \beta_{cons1} \cdot cons_{t-1} + \beta_{cons2} \cdot cons_{t-2} + \color{red}{\beta_{gdp1}} \cdot gdp_{t-1} + \color{red}{\beta_{gdp2}} \cdot gdp_{t-2} + \epsilon_{cons}\]

As such the coefficients on the cross-equation variables are of particular interest when investigating the influence of one series on another.

The MV IRF function can be created in an analogous way to the univariate timeseries function giving an interpretation of e.g the change in \(gdp_{t}\) due to a shock in \(cons_{t-2}\).

This opens up the possibility of testing how influence between variables percolates over time.

Impulse Response

1 2 3 4 5 6

Statsmodels implementation of IRF

Vector Autoregressive Models in PyMC

Arranging the Coefficient Matrices

1 2 3 4 5 6

  • There are allot of indices and variables in a VAR model.
  • They can be concisely expressed in matrix notation, but it’s not intuitive to keep track of all the indices.
  • Good Bayesian workflow abstracts the component parts of the model definition.
### Define a helper function that will construct our autoregressive step for the marginal contribution of each lagged
### term in each of the respective time series equations
def calc_ar_step(lag_coefs, n_eqs, n_lags, df):
    ars = []
    for j in range(n_eqs):
        ar = pm.math.sum(
            [
                pm.math.sum(lag_coefs[j, i] * df.values[n_lags - (i + 1) : -(i + 1)], axis=-1)
                for i in range(n_lags)
            ],
            axis=0,
        )
    ars.append(ar)
    betaX = pm.math.stack(ars, axis=-1)

    return betaX

The Broad Structure

1 2 3 4 5 6

VAR(2)

Implementation in Code

1 2 3 4 5 6

with pm.Model(coords=coords) as model:
        lag_coefs = pm.Normal(
            "lag_coefs",
            mu=priors["lag_coefs"]["mu"],
            sigma=priors["lag_coefs"]["sigma"],
            dims=["equations", "lags", "cross_vars"],
        )
        alpha = pm.Normal(
            "alpha", mu=priors["alpha"]["mu"], sigma=priors["alpha"]["sigma"], dims=("equations",)
        )
        data_obs = pm.Data("data_obs", df.values[n_lags:], dims=["time", "equations"], mutable=True)

        betaX = calc_ar_step(lag_coefs, n_eqs, n_lags, df)
        betaX = pm.Deterministic("betaX", betaX, dims=["time",])
        mean = alpha + betaX

        if mv_norm:
            n = df.shape[1]
            ## Under the hood the LKJ prior will retain the correlation matrix too.
            noise_chol, _, _ = pm.LKJCholeskyCov(
                "noise_chol",
                eta=priors["noise_chol"]["eta"],
                n=n,
                sd_dist=pm.HalfNormal.dist(sigma=priors["noise_chol"]["sigma"]),
            )
            obs = pm.MvNormal(
                "obs", mu=mean, chol=noise_chol, observed=data_obs, dims=["time", "equations"]
            )

Applying the Theory

1 2 3 4 5 6

Macroeconomic Data

The Special Case of Ireland

1 2 3 4 5 6

Ireland’s Posterior Predictive Checks

Table 1: Posterior Mean Correlation
GDP CONS
GDP 1 0.43
CONS 0.43 1

Comparison with Statsmodels MLE fit

1 2 3 4 5 6

Compare

Convergence Checks

1 2 3 4 5 6

Sampling Chains

Hierarchical Bayesian VARs

“Those who only know one country, know no country” – Seymour Martin Lipset

Pooling, Shrinkage and Hierarchy

1 2 3 4 5 6

Shrinkage to Centre Effect

How to model Hierarchical Structure

1 2 3 4 5 6

groups = df[group_field].unique()

    with pm.Model(coords=coords) as model:
        ## Hierarchical Priors
        rho = pm.Beta("rho", alpha=2, beta=2)
        alpha_hat_location = pm.Normal("alpha_hat_location", 0, 0.1)
        alpha_hat_scale = pm.InverseGamma("alpha_hat_scale", 3, 0.5)
        beta_hat_location = pm.Normal("beta_hat_location", 0, 0.1)
        beta_hat_scale = pm.InverseGamma("beta_hat_scale", 3, 0.5)
        omega_global, _, _ = pm.LKJCholeskyCov(
            "omega_global", n=n_eqs, eta=1.0, sd_dist=pm.Exponential.dist(1)
        )
        ## Group Specific Parameter
        for grp in groups:
            df_grp = df[df[group_field] == grp][cols]
            z_scale_beta = pm.InverseGamma(f"z_scale_beta_{grp}", 3, 0.5)
            z_scale_alpha = pm.InverseGamma(f"z_scale_alpha_{grp}", 3, 0.5)
            lag_coefs = pm.Normal(
                f"lag_coefs_{grp}",
                mu=beta_hat_location,
                # Hierarchical offset
                sigma=beta_hat_scale * z_scale_beta,
                dims=["equations", "lags", "cross_vars"],
            )
            alpha = pm.Normal(
                f"alpha_{grp}",
                mu=alpha_hat_location,
                # Hierarchical Offset
                sigma=alpha_hat_scale * z_scale_alpha,
                dims=("equations",),
            )

            betaX = calc_ar_step(lag_coefs, n_eqs, n_lags, df_grp)
            betaX = pm.Deterministic(f"betaX_{grp}", betaX)
            mean = alpha + betaX

            n = df_grp.shape[1]
            noise_chol, _, _ = pm.LKJCholeskyCov(
                f"noise_chol_{grp}", eta=10, n=n, sd_dist=pm.Exponential.dist(1)
            )# Hierarchical Offset
            omega = pm.Deterministic(f"omega_{grp}", rho * omega_global + (1 - rho) * noise_chol)
            obs = pm.MvNormal(f"obs_{grp}", mu=mean, chol=omega, observed=df_grp.values[n_lags:])

Ireland is an Outlier

1 2 3 4 5 6

Country Specific Estimates

The Global Correlation Structure

1 2 3 4 5 6

Global Correlations

The Posterior Predictive Fits

1 2 3 4 5 6

Posterior Predictive Fits by country. See PYMC docs here

Wrap Up

  • Bayesian Timeseries models are flexible and transparent tools for forecasting and causal inference.
  • VAR models can help interrogate questions of the relationships between timeseries data.
  • Bayesian VAR models allow the specification of regularising priors and encode structural assumptions about direction of influence.
  • Hierarchical Bayesian VAR models offer the chance to borrow information across groups and interrogate the questions about the relationships between timeseries within and across the groups.
  • In the context with sparse or short timeseries the hierarchical model can augment our understanding of the lagged relationships of influence.

For more Bayesian Content

PYMCON